[1] 14
[1] 2.5
[1] 40
R and RStudio, what is the difference? Why do I need the two?
R is both a language and program. However working GUI that comes with R is not the most pleasant of experiences.
RStudio is an integrated development environment (IDE) for R. It is a user friendly interface for the R programming language making writing and running code much easier and more intuitive, as well as providing a host of addins to make your working life easier including: a console, syntax-highlighting editor that supports direct code execution (basically, colours your code and allows a user to select code to run on the fly), tools for plotting, history, debugging and workspace management.
You do not need RStudio to code in R, but it makes life a lot easier.
Other text editors and IDEs like VS Code are also available.
RStudio settings are accessed through Global Options… from Tools in the menu bar.
There various options that allow you to customise RStudio, from appearance, session setting etc.
RStudio Projects are a way to organise your work in R. They help you manage your working directories, workspace and source documents efficiently. It allows you to work on different analytical task in a separate and isolated environment from other work.
To create a RStudio Project, select the Project dropdown in the upper right hand corner.
Select New Project…
Select New Directory
Select New Project
R console can be used as a calculator.
But that is not very interesting and won’t get us very far.
You can also assign values to variables using the assignment operator <-.
Whist it easy to execute code in the console, after a while it comes against limitations. So that is where scripts come in.
To set up a new blank script, click on the little plus sign in the top left of your screen underneath File. Then select R Script. (This can be achieved alternatively by entering SHIFT + CTRL + N)
This will open a new R script in the source panel of RStudio.
It’s good practice to put a name, a date and author at the top of scripts.
At the fundamental level, R has six basic data types. These are:
Numeric values in R can be either integers (whole numbers) and doubles (floating-point numbers), but in general there is not much distinction between both in R and they are interchangeable.
Scientific notation can be used to declare a numeric value. 1e3 is the equivalent of 1000.
In some code you might see a number followed by an L e.g. x <- 2L This explicitly defines x as an integer.
We have already seen some arithmetic operators, but the following operators that be used on numeric types.
| Operator | Name | Example |
|---|---|---|
+ |
Addition | x + y |
- |
Subtraction | x - y |
* |
Multiplication | x * y |
/ |
Division | x / y |
^ |
Exponent | x ^ y |
%% |
Modulus | x %% y |
%/% |
Integer (Floor) Division | x %/% y |
Logical values in R are used to represent boolean data. They have three possible values: TRUE, FALSE and NA.
T and F can also be used as shorthand for TRUE and FALSE, but it is not advisable to use as these shorthands can be overwritten.
Logical operators are used to perform logical operations on boolean values. Here are some common logical operators in R:
| Operator | Name | Example |
|---|---|---|
& |
AND | x & y |
| |
OR | x | y |
! |
NOT | !x |
Comparison operators are used to compare values, where a boolean value returned.
| Operator | Name | Example |
|---|---|---|
== |
Equal | x == y |
!= |
Not Equal | x != y |
> |
Greater Than | x > y |
< |
Less Than | x < y |
>= |
Greater Than or Equal | x >= y |
<= |
Less Than or Equal | x <= y |
When trying to determine if a value is NULL or NA, the above comparison operators do not work. i.e. x == NULL or x != NA will not work.
Must use is.null() or is.na() instead.
The logical and comparison operators allow for controlling the flow of how code that gets executed. The most common control flows are if and if-else statements.
Characters in R are used to represent text data. They are surrounded by either a single ' or double " quotes.
[1] "hello"
[1] "200"
Will see more about characters and strings in week 5.
Vectors are the simplest type of data structure in R. They can hold numeric, character, or logical data. All the elements of a vector must be of the same type.
The combine function c() is used to create a vector of two or more elements.
Factors are a special type of vector used to handle and store categorical data. A factor can be ordered or unordered.
reviews <- c("good", "bad", "v bad", "bad", "good", "v good")
reviews_fct <- factor(reviews)
reviews_fct[1] good bad v bad bad good v good
Levels: bad good v bad v good
However, the levels are not in a logical order. We can specify the this order using the levels parameter.
reviews_fct_ordered <- factor(reviews, levels = c("v bad", "bad", "good", "v good"))
reviews_fct_ordered[1] good bad v bad bad good v good
Levels: v bad bad good v good
We will come back to factors when plotting (week 4 and 5)
Lists are a collection of data, that can be of different types.
Lists can be listed placed within another list.
scotland <- list(
captial = "Edinburgh",
population = 5447000,
devolved = TRUE
)
wales <- list(
capital = "Cardiff",
population = 3132700,
devolved = TRUE
)
northern_ireland <- list(
capital = "Belfast",
population = 1910500,
devolved = TRUE
)
uk <- list(
England = england,
Scotland = scotland,
Wales = wales,
"Northern Ireland" = northern_ireland
)
str(uk)List of 4
$ England :List of 3
..$ capital : chr "London"
..$ population: num 57112500
..$ devolved : logi FALSE
$ Scotland :List of 3
..$ captial : chr "Edinburgh"
..$ population: num 5447000
..$ devolved : logi TRUE
$ Wales :List of 3
..$ capital : chr "Cardiff"
..$ population: num 3132700
..$ devolved : logi TRUE
$ Northern Ireland:List of 3
..$ capital : chr "Belfast"
..$ population: num 1910500
..$ devolved : logi TRUE
Matrices are two-dimensional and arrays are multi-dimensional, homogeneous data structures. All elements in a matrix or array must be of the same type. Will not be used much in this series.
Data frames are one of the most commonly used data structures in R for storing tabular data. Similar to matrices above, they consist of columns and rows of data. Unlike matrices, data different across columns. Data must be of the same type within column though.
Converting the UK data above
countries <- c("England", "Scotland", "Wales", "Northern Ireland")
capital <- c("London", "Edinburgh", "Cardiff", "Belfast")
population <- c(57112500, 5447000, 3132700, 1910500)
devolved <- c(FALSE, TRUE, TRUE, TRUE)
uk_df <- data.frame(countries, capital, population, devolved)
uk_df countries capital population devolved
1 England London 57112500 FALSE
2 Scotland Edinburgh 5447000 TRUE
3 Wales Cardiff 3132700 TRUE
4 Northern Ireland Belfast 1910500 TRUE
Functions in R are a fundamental part of programming in the language. They allow you to encapsulate code into reusable blocks, making your scripts more modular and easier to maintain.
Packages in R are collections of functions, data, and compiled code that extend the capabilities of R. They are generally created for specialised tasks.
Packages are available to download from a number of sources, with CRAN being the main repository for packages. Currently there are 22,416 packages available on CRAN.
The install.packages() command is used to install a package from CRAN.
Once installed, a package needs to be loaded to be able to use the functions and data of the package. This done by using the library() command.
The tidyverse is a collection of R packages that are designed to make data analysis in R easier and more consistent. It includes packages for data manipulation, visualization, and more..
The tidyverse made up of nine core packages which are loaded when library(tidyverse) is called, with each package specialising in an area of data analysis/data science.
The advantage of tidyverse packages is that they share a design philosophy, common grammar and data structures.
CSV (Comma-separated values) files are widely used for data storage and exchange due to their simplicity and compatibility with various software applications. Therefore importing data from CSV is important skill.
Within R, there is the read.csv() function. However, we will look at the read_csv() function from the {readr} package installed with the tidyverse, as it provides a number of advantages.
Rows: 139 Columns: 8
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
dbl (8): Year, Wheat, Barley, Oats, Rye, Mixed Corn, Triticale, Oilseed Rape
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# A tibble: 6 × 8
Year Wheat Barley Oats Rye `Mixed Corn` Triticale `Oilseed Rape`
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 2018 7.8 5.7 5 NA NA NA 3.4
2 2019 8.9 6.9 5.9 NA NA NA 3.3
3 2020 7 5.9 4.9 NA NA NA 2.7
4 2021 7.8 6.1 5.6 NA NA NA 3.2
5 2022 8.6 6.6 5.7 NA NA NA 3.7
6 2023 8.1 6.1 5 NA NA NA 3.1
What if my data is across a number of CSVs? read_csv() allows data to be read from multiple files once it is in the same format across the different CSV files.
Rows: 74278 Columns: 11
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (3): cod, port, coo
dbl (8): perref, type, comcode, sitc, mode, value, mass, supp_unit
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# A tibble: 6 × 11
perref type comcode sitc cod port coo mode value mass supp_unit
<dbl> <dbl> <dbl> <dbl> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 202301 1 22041011 11215 FR <NA> FR 0 22139627 767412 766616
2 202301 1 22041011 11215 FR ZZZ <NA> NA 1544890 25788 25894
3 202301 1 22041011 11215 FR LON FR 10 2591004 122650 122648
4 202301 1 22041011 11215 FR MED FR 60 93855 6615 6615
5 202301 1 22041011 11215 FR DOV FR 60 936092 47963 47994
6 202301 1 22041011 11215 FR LGP FR 10 4068 179 47